Probability distribution for Bernoulli process:
\[
P(k | \theta) = \theta^k (1 - \theta)^{1 - k}
\]
where
\(k = 1\) indicates success and \(k = 0\) indicates failure; and
the probability of success on any trial is \(\theta\) .
Probability distribution for binomial process:
\[
P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n - k}
\]
where
\(k\) successes in \(n\) trials; and
the probability of success on any trial is \(\theta\) .
For session \(i\) there were \(k_i\) hits from \(n_i\) spins.
At the session level the relationship between spins and hits can be modelled as a binomial process. For each session we are considering the number of spins (or “trials”) denoted by \(n\) and the number of hits (or “successes”) denoted by \(k\) . And \(\theta\) is the probability of success on any spin.
The binomial distribution allows us to calculate the probability of \(k\) given specific values for \(n\) and \(\theta\) . So this assumes that you know \(\theta\) … But what if you don’t? Certainly for most slot machines you don’t know the probability of winning. If you did then you probably wouldn’t be very tempted to play.
Suppose we had observations of \(k\) and \(n\) , and wanted to estimate \(\theta\) .
Bayesian Inference
Bayes’ Theorem \[
p(\theta|y, X) = \frac{p(y|X, \theta) p(\theta)}{p(y)} \propto p(y|\theta) p(\theta)
\] where
\(y\) are observations;
\(X\) are predictors;
likelihood — \(p(y|X, \theta)\) is probability of data given parameters;
prior — \(p(\theta)\) is parameter distribution before data; and
posterior — \(p(\theta|y, X)\) is parameter distribution given data.
Bayes’ Theorem is probably familiar to many of you, but let’s just take a moment to review.
The prior specifies what we know about the model parameters before considering the data.
The likelihood gives the probability of the data conditional on the model and choice of parameter.
The posterior is what we know about the model parameters after factoring in the data.
A powerful feature is that this can be applied iteratively, with the posterior from one iteration becoming the prior in the following iteration.
Despite the apparent simplicity of this relationship it has historically been rather difficult to apply due to computational challenges.
Note:
The denominator is sometimes called the “Evidence”. It’s just the marginal distribution of \(y\) .
So much for theory. Analytical expressions are rare in practice.
Confounding features:
data are often multi-dimensional;
models have multiple parameters.
So evaluating \(p(\theta|y, X)\) becomes challenging!
One approach to applying Bayes’ Theorem is to use a regular grid of parameter values. This works well and you can see that the shape of the posterior (as represented by points on the grid) evolves with each iteration.
But there’s a major problem with the grid approximation: it scales poorly with the number of model parameters and rapidly becomes intractable.
Monte Carlo (Plain Vanilla Version)
Generate independent samples of \(\theta^{(i)}\) .
Need to have \(p(\theta^{(m)} | y, X)\) .
Monte Carlo is an alternative technique which generates independent, random samples from the parameter space.
Although it’s better in higher dimensions than the grid approximation, it’s still far from perfect.
However there’s yet another alternative…
Note:
The objective is to jump around parameter space in such a way that the probability of being at a point is proportional to the distribution.
Efficient algorithms exist for doing this with simple distributions. However, any non-trivial distribution will generally require you to apply something like rejection sampling. And to make that efficient we’d need to know the maximum value of the distribution.
Markov Chain Monte Carlo (MCMC)
Generate a series of samples: \(\theta^1\) , \(\theta^2\) , \(\theta^3\) , … \(\theta^M\) .
\(\theta^m\) depends on \(\theta^{m-1}\) .
Need to have \(p(\theta^{m} | \theta^{m-1}, y, X)\) .
Markov Chain Monte Carlo generates a sequence of samples where each new sample is random but also dependent on the previous sample.
We’re not going to stress about the details of Markov chains because we’ll be using a remarkable piece of software called Stan.
Stan is a high level language for writing statistical models.
Note:
It uses Hamiltonian Monte Carlo, which applies some general principles from Physics to provide much more efficient sampling. Yes, it takes a little longer to generate each sample, but those samples are guaranteed to explore parameter space much more effectively.
The Hamiltonian MC is much more efficient than vanilla MCMC or the Metropolis Algorithm.
Stan Workflow
Choose a model.
Write Stan program (likelihood and priors).
Stan parser converts this to C++.
Compile C++.
Execute compiled binary.
Evaluate results. Optionally return to 2.
Inference based on posterior sample.
To use rstan you need a .stan and a .R .
Stan Skeleton
We’ll start by building a model to address Maud’s first burning question: what is the hit rate?
This is the basic structure of a Stan file. It consists of a number of blocks, not all of which are required.
The most important blocks are:
data which specifies the observations;
parameters which lists the parameters of the model; and
model which defines the likelihood and priors.
data {
int<lower = 0> N;
int<lower = 0, upper = 1> y[N];
}
parameters {
real<lower = 0, upper = 1> theta;
}
model {
y ~ bernoulli(theta);
}
Our first Stan model has two elements of data: the number of samples and the outcomes of the individual spins, encoded as a binary vector.
We know that theta must lie between 0 and 1.
We specify a Bernoulli likelihood.
We haven’t given a prior on theta, so by default Stan will use a uniform prior: any value of theta is equally likely.
library (rstan)
# Use all available cores.
#
options (mc.cores = parallel:: detectCores ())
trials <- list (
N = nrow (spin),
y = spin$ success
)
fit <- stan (
file = "bernoulli.stan" ,
data = trials,
chains = 2 , # Number of Markov chains
warmup = 500 , # Warmup iterations per chain
iter = 1000 # Total iterations per chain
)
This is the corresponding R code. Things to note:
it loads the rstan library;
data are transferred to Stan via a list (with names corresponding to those in the Stan file);
Stan is invoked via the stan() function; and
specify the number of chains as well as number of warmup and compute iterations per chain.
The “chains” are initially a little mysterious, but these are basically just independent series of samples. They aid better sampling of parameter space and also facilitate parallelism.
Note that there are 1000 iterations in total per chain, of which 500 are warmup, which leaves 500 active iterations. So, in total there are 1000 (2 times 500) active iterations.
Note:
Set the number of cores to be used for parallel execution (otherwise chains are run in series).
A traceplot shows the sampled values as a function of iteration number.
Let’s take a moment to see where those samples came from. We specified 2 chains, each with 500 warmup iterations and 1000 iterations in total, leaving 500 active iterations per chain. So that accounts for the 1000 samples in the stanfit object.
The chains are uncorrelated, but successive samples within each chain are clearly related to each other. Ideally we are aiming for good “mixing” across the domain of the distribution.
What’s happening during the warmup phase? The parameters of the MC are getting tuned to optimally explore the parameter space (one aspect of this is choosing the step size). By the time that the active iterations start the sampler should have more or less reached an equilibrium.
[1] "stanfit"
attr(,"package")
[1] "rstan"
samples <- as.data.frame (fit)
head (samples)
theta lp__
1 0.3326573 -1228.633
2 0.3136229 -1226.911
3 0.3276020 -1227.864
4 0.3203996 -1227.155
5 0.2942303 -1228.577
6 0.3166354 -1226.968
[1] 1000
We don’t actually get the posterior directly. The stanfit object is really just a collection of samples from the posterior.
Note:
We can access the samples using the extract() function, which yields a list. It’s normally easier to just convert these samples to a data frame.
The number of samples that we get is precisely the total number of active iterations across all chains.
data {
int<lower=0> N;
int hits[N];
int spins[N];
}
parameters {
real<lower=0,upper=1> theta;
}
model {
hits ~ binomial(spins, theta); // Likelihood
theta ~ beta(2, 2); // Prior
}
Let’s take a look at the session data. We need to treat these data as coming from a binomial distribution.
The data that we are providing consists of two vectors, hits and spins.
We are assuming a binomial relationship between hits and spins with a success probability denoted by the parameter theta.
As an experienced gambler Maud knows that the likelihood of the hit rate was not uniformly distributed between 0 and 1. We could have factored this knowledge into the computation by applying a more informed prior. For example, a broad beta distribution peaked at 0.5. The results are essentially unchanged though.
Note:
The ~ represents a stochastic relationship. An = would indicate a deterministic relationship.
The beta distribution is the conjugate prior for the binomial likelihood (this means that the corresponding posterior will also be a beta distribution).
print (fit, probs = c (0.025 , 0.5 , 0.975 ))
Inference for Stan model: binomial-beta-prior.
2 chains, each with iter=1000; warmup=500; thin=1;
post-warmup draws per chain=500, total post-warmup draws=1000.
mean se_mean sd 2.5% 50% 97.5% n_eff Rhat
theta 0.31 0.00 0.01 0.29 0.31 0.33 359 1
lp__ -1228.99 0.04 0.86 -1231.39 -1228.66 -1228.45 466 1
Samples were drawn using NUTS(diag_e) at Mon May 14 07:50:08 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Use summary() to get information for each chain.
This is very neatly presented in the stanfit summary information.
There’s the mean value of theta as well as the lower and upper quantiles of the distribution. Don’t think of a confidence interval. These are quantiles of the density. Close to the median they are stable, but further into the tails they can be noisy.
The average hit rate is in good agreement with our earlier result.
Maud is immediately more comfortable with this result because, even with a uniform prior, the success rate is constrained to a reasonable domain between 0 and 1.
n_eff is the effective sample size. This takes into account that the samples are not independent (this is a consequence of correlation in the Markov Chain). This is a vital component of the Monte Carlo version of the Central Limit Theorem.
We can see the degree of correlation in a simple autocorrelation plot.
The samples for success probability look like they have a Normal distribution and are centred on the value that we expected from earlier analysis, around 30%.
The nature of MCMC is that successive samples of the parameters are correlated. As a result the samples cannot be considered as independent and so we need to adjust the sample count accordingly.
Let’s move onto Maud’s second burning question, quantifying the RTP.
Q3: Hit Rate per Combination
Pay Table: Cat and Mouse
description
symbols
payout
0
1x mouse
🐭
1
1x cat
🐱
2
2x mouse
🐭🐭
5
2x cat
🐱🐱
10
3x mouse
🐭🐭🐭
20
3x cat
🐱🐱🐱
100
data {
int<lower=0> N;
real rtp[N];
}
parameters {
real<lower=0, upper=1> theta[6];
real<lower=0> sigma;
}
transformed parameters {
real<lower=0> pay;
pay = theta[1] * 1 + // Payline 1 -> 1x
theta[2] * 2 + // Payline 2 -> 2x
theta[3] * 5 + // Payline 3 -> 5x
theta[4] * 10 + // Payline 4 -> 10x
theta[5] * 20 + // Payline 5 -> 20x
theta[6] * 100; // Payline 6 -> 100x
}
model {
rtp ~ lognormal(log(pay) - sigma * sigma / 2, sigma);
theta[1] ~ beta(2, 2); // Mode = 0.5
theta[2] ~ beta(2, 4); // Mode = 0.25
theta[3] ~ beta(2, 5); // Mode = 0.2
theta[4] ~ beta(2, 8); // Mode = 0.125
theta[5] ~ beta(2, 10); // Mode = 0.1
theta[6] ~ beta(2, 16); // Mode = 0.0625
}
The priors are based on Maud’s gut feel for the relative frequency of each of the payouts. They carry a little information about the probabilities of each payline, but not a lot.
It’s readily apparent here the importance of the warmup iterations: the first few samples for some of the parameters are very far from the core of their distribution. This could be improved by providing more restrictive priors.
mean se_mean sd 2.5% 97.5% n_eff Rhat
theta[1] 0.140889558 1.453005e-03 0.091252684 0.0176434998 0.356934146 3944.185 0.9997016
theta[2] 0.067440693 6.854080e-04 0.043349006 0.0087506512 0.171524124 4000.000 0.9997912
theta[3] 0.028990953 2.894410e-04 0.018305855 0.0040694145 0.072763086 4000.000 0.9998409
theta[4] 0.014376898 1.439473e-04 0.009104029 0.0019740490 0.036289607 4000.000 1.0003147
theta[5] 0.007360153 7.254394e-05 0.004588082 0.0009349501 0.018455450 4000.000 1.0001251
theta[6] 0.001508806 1.453327e-05 0.000919165 0.0002091751 0.003683408 4000.000 0.9993676
The 1x payout is triggered on average every 7 spins.
The 100x payout is triggered on average only every 663 spins.
These are the vital statistics for determining the characteristics of the game. So we have effectively reverse engineered it.
Metropolis-Hastings Algorithm
Randomly sample \(\theta^{(0)}\) .
Set \(i = 1\) .
Randomly sample proposal \(\theta^{*}\) in the vicinity of \(\theta^{i-1}\) .
Sample \(u\) uniformly on \([0, 1)\) .
\[
\theta^{(i)} = \left\{\begin{array}{ll}
\theta^{*} & \text{if } u \cdot p(\theta^{i-1}|y, X) < p(\theta^{*}|y, X) \\
\theta^{(i-1)} & \text{otherwise.}
\end{array}\right.
\]
Increment \(i\) .
Return to 1.
The Metropolis-Hastings Algorithm is the classical implementation of MCMC.
The algorithm proceeds by sampling a proposal point in the vicinity of the current point and then either accepting or rejecting that point depending on the ratio of the probability density at the two points. If the probability of the new point is higher then we always accept. If it’s lower then we accept with a probability that is the ratio of the densities.
It might occur to you that the samples are now no longer independent. And you’d be completely correct. We’ll see how this is accounted for.
Metropolis Monte Carlo is better than the Grid Approximation. However in an absolute sense it’s not really all that good. Especially not in higher dimensions, where it can take a very long time to sample the parameter space.
Note:
The size of the “vicinity” (effectively the step size) determines the acceptance rate of new points. If it’s too large then few new points will be accepted. If it’s too small then the rate of diffusion will be painfully slow.
Without step 3 this would essentially give us diffusion, a random walk. But this step, the Metropolis Acceptance Procedure, ensures that sampling converges to the underlying distribution.
The principle advantage of MCMC is that we don’t need to know the maximum value of the posterior.
Also it only depends on the ratio of posteriors, so these do not need to be normalised.
Posterior Predictive Distribution
\[
p(\tilde{y}|y) = \int_\theta p(\tilde{y}|\theta) p(\theta|y) d\theta
\]
Poisson
# trials <- list(
# N = nrow(sessions),
# spins = sessions$spins
# )
#
# fit <- stan(
# file = "poisson.stan",
# data = trials,
# chains = 4,
# warmup = 1000,
# iter = 2000
# )
# extract(fit)
# hist(extract(fit)$lambda)
Regression
# ggplot(sessions, aes(spins, wager)) + geom_jitter()
Regression Stan
data {
int<lower=0> N;
vector[N] y; // Total wager
vector[N] x; // Number of spins (standardised)
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma); // Likelihood
//
alpha ~ normal(0, 10); // Prior (Normal)
beta ~ normal(0, 10); // Prior (Normal)
sigma ~ cauchy(0, 5); // Prior (Cauchy)
}
Because the number of spins has been standardised, the intercept will be the wager for the average number of spins.
Regression Stan R
In a linear model we are saying that the response is normally distributed with a mean that depends on the covariates and a constant standard deviation. Assuming homogeneous noise.
The half-Cauchy is essentially a one-sided, heavy tailed version of the Normal distribution. This constrains the value of sigma to be positive but is not as restrictive as a Normal distribution.
… the theory of inverse probability is founded upon error and must be wholly rejected. Sir Ronald Fisher (1925)
Not everybody’s a fan of the Bayesian approach. Ronald Fisher, had this to say about inverse probability as Bayesian inference was known in his day.
Possibly one of the reasons for their skepticism is that historically it has been computationally challening to actually apply these techniques. This is no longer true.
In general you can’t apply Bayesian techniques analytically.
\[
\text{hit rate} = \theta^* = \frac{\text{total hits}}{\text{total spins}} = \frac{\sum_i k_i}{\sum_i n_i}
\]
with (sessions, sum (hits) / sum (spins))
[1] 0.3128803
Well \(\theta\) is just the expected number of hits per spin. So we could lump all of the data together and treat it as a single massive experiment, taking the ratio of the total number of hits to the total number of spins across all sessions.
This would certainly maximise the amount of data and should, in principle, give us a reasonably accurate result. But there is no way to quantify our uncertainty.
\[
\text{hit rate for session } i = \theta^*_i = \frac{k_i}{n_i}
\]
# A tibble: 1 x 2
session_avg session_std
<dbl> <dbl>
1 0.3100018 0.1030164
95% confidence interval extends from 28.9% to 33.1%.
Another approach is to treat each session as an independent experiment, each of which yields an estimate for the hit rate. Now we can calculate both mean and variance, and hence a confidence interval on the hit rate.
Maud is fairly happy with this result, but it does concern her that the confidence interval on the hit rate could, in principle, extend to values less than zero or greater than one. Neither of which is physically possible.
Assuming that sessions \(i = 1, 2, 3, \ldots\) are independent:
\[
P(k | n, \theta) = \prod_i P(k_i | n_i, \theta) = L(\theta; n, k)
\]
Let’s consider another approach: the joint probability distribution for all of the sessions lumped together. If we assume that they are independent, which is a reasonable assumption, then we can simply multiply together the distributions for each session. This gives us, one the one hand, a distribution for the number of hits in each of the sessions. But if we think of this as a function of \(\theta\) then it’s the likelihood of a particular value of \(\theta\) given observed values of \(k_i\) and \(n_i\) .
If we maximise this function then we have the most likely value of \(\theta\) based on the data.
Note:
\(P(k | n, \theta)\) and \(L(\theta; n, k)\) are functionally equivalent but
\(P(k | n, \theta)\) is a function of \(k\) (and is a probability distribution) and
\(L(\theta; n, k)\) is a function of \(\theta\) (and is not a probability distribution).
Log-likelihood for binomial process (multiple trials):
\[
LL(\theta; n, k) = \sum_i k_i \log{\theta} + (n_i - k_i) \log{(1 - \theta)}
\]
However we’d be multiplying together a collection of small numbers, which would mean that we would rapidly lose precision. Instead we work with the log likelihood.
Note:
Neglect the binomial coefficient since it’s a constant (as a function of theta).
Because the logarithm is a strictly increasing function, the log likelihood is maximised in the same place as the likelihood.
# A tibble: 6 x 2
theta log_likelihood
<dbl> <dbl>
1 0.31 -1225.411
2 0.32 -1225.604
3 0.30 -1226.146
4 0.33 -1226.692
5 0.29 -1227.843
6 0.34 -1228.649
If we plot the log-likelihood as a function of \(\theta\) then it’s a simple matter to identify the most probable value.
In the Maximum Likelihood approach we simply select the value of the parameter which gives the largest likelihood.
This gives us a point estimate for \(\theta\) with no indication of uncertainty. However, the likelihood is a tool which will allow us to get a much better characterisation of the hit rate.
Note:
The assumption implicit in this approach is that in the absence of data all parameter values are equally probable.
ShinyStan
library(shinystan) launch_shinystan(fit) # # Diagnose -> PPcheck “Posterior predictive check” (look at distribution of observed data versus replications - do they look similar? “If our model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.”)